Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences
نویسندگان
چکیده
MOTIVATION Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is 3-fold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determining the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scale simulation method to simulate data from the distribution of SK-LD (symmetric Kullback-Leibler discrepancy). These simulated data can be used to estimate the degree of dissimilarity beta between any pair of DNA sequences. RESULTS Our study shows (1) for whole sequence similiarity/dissimilarity identification the window size taken should be as large as possible, but probably not >3000, as restricted by CPU time in practice, (2) for each measure the optimal word size increases with window size, (3) when the optimal word size is used, SK-LD performance is superior in both simulation and real data analysis, (4) the estimate beta of beta based on SK-LD can be used to filter out quickly a large number of dissimilar sequences and speed alignment-based database search for similar sequences and (5) beta is also applicable in local similarity comparison situations. For example, it can help in selecting oligo probes with high specificity and, therefore, has potential in probe design for microarrays. AVAILABILITY The algorithm SK-LD, estimate beta and simulation software are implemented in MATLAB code, and are available at http://www.stat.ncku.edu.tw/tjwu
منابع مشابه
Optimal Word Sizes for Dissimilarity Measures and Estimation of the Degree of Dissimilarity Between DNA Sequences Running Head: Optimal word size and degree of dissimilarity
Motivation: Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is threefold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determine the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scal...
متن کاملStatistical measures of DNA sequence dissimilarity under Markov chain models of base composition.
In molecular biology, the issue of quantifying the similarity between two biological sequences is very important. Past research has shown that word-based search tools are computationally efficient and can find some new functional similarities or dissimilarities invisible to other algorithms like FASTA. Recently, under the independent model of base composition, Wu, Burke, and Davison (1997, Biom...
متن کاملA comprehensive experimental comparison of the aggregation techniques for face recognition
In face recognition, one of the most important problems to tackle is a large amount of data and the redundancy of information contained in facial images. There are numerous approaches attempting to reduce this redundancy. One of them is information aggregation based on the results of classifiers built on selected facial areas being the most salient regions from the point of view of classificati...
متن کاملDesign of Dissimilarity Measures: A New Dissimilarity Between Species Distribution Areas
In many situations, dissimilarities between objects cannot be measured directly, but have to be constructed from some known characteristics of the objects of interest, e.g. some values on certain variables. From a philosophical point of view, the assumption of the objective existence of a “true” but not directly observable dissimilarity value between two objects is highly questionable. Therefor...
متن کاملA Normalized Parameter for Similarity/Dissimilarity Characterization of Sequences
Abstract.We propose a normalized parameter for characterization of similarity/dissimilarity of two sequences providing a smoothly varying measure for varying symmetry score. Such a parameter can be used for analysis of experimental data and fitting to a theoretical model, mirror symmetry estimation with respect to a selected or presumed symmetry axis, in particular, in symmetry detection applic...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Bioinformatics
دوره 21 22 شماره
صفحات -
تاریخ انتشار 2005